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Abstract 

We  address  the  problem  of  message  ordering  for  reliable 
multicast  communication.  End-to-end  multicast  ordering  is 
useful  for  ensuring  the  collective  integrity  and  consistency 
of  distributed  operations.  It  is  applicable  for  distributed 
multiparty  collaboration  or  other  multipoint  applications, 
where  the  ordered  reception  of  messages  at  all  hosts  is  crit¬ 
ical. 

Existing  reliable  multicast  protocols  largely  lack  support 
for  ordering.  Our  novel  mechanism  can  be  added  to  exist¬ 
ing  reliable  multicast  services  at  low  cost  by  performing 
cascaded  total  ordering  of  messages  among  on-tree  hosts 
en  route  from  senders  to  receivers.  The  protocol  operates 
directly  on  a  given  end-to-end  multicast  tree,  contrasting 
other  tree-based  approaches  requiring  a  separate  propaga¬ 
tion  graph  to  be  built  to  compute  ordering  information.  For 
better  load  distribution,  resilience,  and  ordered  subcasting 
of  messages  within  multicast  groups,  sequencer  nodes  are 
elected  dynamically  based  on  address  extensions  to  hosts  in 
the  multicast  tree. 

A  taxonomy  of  broadcast  and  multicast  ordering  solu¬ 
tions  and  comparative  cost  analysis  show  that  reliable  mes¬ 
sage  delivery  integrated  with  staggered  ordering  in  end- 
to-end  multicast  trees  is  more  efficient,  scalable,  and  less 
costly  to  deploy. 
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1  Introduction 

IP  multicast  communication  [10]  generalizes  the  point- 
to-point  and  broadcast  communication  model  to  multipoint 
dissemination  of  messages.  A  source  must  send  a  packet 
only  once  to  the  network  interface  and  packets  are  transpar¬ 
ently  replicated  on  their  transmission  paths  to  the  receivers. 
This  form  of  communication  is  indispensable  for  networked 
applications  with  high-volume  data  transfer,  such  as  dis¬ 
tributed  software  updates,  news  casts,  video-on-demand, 
and  interactive  applications,  for  example  distributed  sim¬ 
ulations  or  telecollaboration  systems.  Data  handled  by 
these  applications  fall  into  two  categories,  continuous  me¬ 
dia  streams  and  non-real-time  data.  Real-time  data  delivery, 
e.g.,  for  video  or  audio  streams,  is  typically  best-effort  and 
unordered,  but  must  observe  deadlines  to  be  useful  for  an 
application.  Non-real-time  packets  carry  discrete  data,  and 
may  require  reliable,  ordered  delivery  based  on  the  applica¬ 
tion  semantics. 

Changes  in  datagram  routing  or  transmission  errors  may 
cause  packets  to  arrive  at  their  destination  out-of-sequence. 
Disordered  delivery  of  packets  in  a  distributed  application 
may  result  in  different  views  of  the  group  state  at  end-hosts. 
Ordering  of  messages  compensates  for  the  lack  of  a  global 
system  state  and  the  effects  of  asynchrony,  unpredictable 
network  delay,  and  disparities  in  host  processing  in  dis¬ 
tributed  communication.  It  warrants  that  destination  pro¬ 
cesses  observe  the  same  order  of  reception  of  messages.  Or¬ 
dering  is  complemented  by  reliability  and  atomicity.  Relia¬ 
bility  guarantees  that  messages  eventually  arrive  correctly  at 
their  destinations,  and  atomicity  guarantees  that  a  message 
is  received  by  all  members  of  a  multicast  group  or  none. 

Consider  a  distributed  interactive  simulation  with  many 
moving,  interacting  entities,  where  a  message  rrii  is  reliably 
multicast  from  source  si  to  receiver  group  Reci,  and  m2  is 
reliably  multicast  from  S2  to  Rec2.  A  host  which  belongs 
to  Reci  may  receive  message  mi  before  m2,  while  another 
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host  belonging  to  both  groups  may  receive  the  messages 
in  the  opposite  order.  Correct  operation  of  the  simulation 
system  requires  not  only  that  the  input  stream  is  equivalent 
for  all  replicas,  but  all  input  events  have  to  be  delivered  to 
replicated  instances  of  shared  applications  in  the  same  or¬ 
der.  Some  ordering  protocol  must  intercept,  or  better,  be 
integrated  in  the  delivery  process  to  guarantee  such  consis¬ 
tency. 

The  majority  of  existing  reliable  multicast  solutions  [23] 
lack  ordering  services.  A  comparison  of  the  performance 
characteristics  of  such  protocols  [20],  entailing  sender-  or 
receiver  initiated  protocols,  ring-  or  tree-based  protocols, 
and  tree  protocols  with  negative  acknowledgments  and  peri¬ 
odic  polling,  showed  that  the  latter  protocol  type  is  the  most 
scalable  and  efficient  approach  known  to  date  among  de¬ 
ployable  systems.  Based  on  these  observations,  our  objec¬ 
tive  is  to  examine  how  ordering  services  can  be  integrated 
with  reliable  multicasting,  in  particular  with  tree-based  pro¬ 
tocols,  preserving  scalability  and  efficiency.  We  describe 
a  solution  for  this  problem  using  the  idea  of  staggered  or¬ 
dering  of  messages  on  their  delivery  paths  from  sources  to 
receivers  in  the  reliable  multicast  tree,  which  is  also  used 
for  logical  connectivity  between  hosts  for  the  purpose  of  er¬ 
ror  recovery.  In  contrast  to  earlier  work,  our  protocol  does 
not  require  construction  of  a  separate  logical  propagation 
graph  or  global  clock  synchronization,  and  ordering  is  dis¬ 
tributed  across  nodes  on  the  delivery  paths  between  sources 
and  receivers  in  the  multicast  tree. 

The  paper  is  structured  as  follows:  Section  2  introduces 
relevant  terms  and  our  system  model.  Section  3  presents 
a  description  of  the  TOM  (Tree-based  Ordered  Multicast) 
protocol.  Section  4  introduces  a  taxonomy  for  ordering 
schemes  and  discusses  performance  figures  for  contending 
solutions,  making  the  case  for  tree-based  ordered  multicast. 
Section  5  reviews  related  work,  and  conclusions  are  offered 
in  Section  6. 

2  System  Model  and  Assumptions 

Our  network  model  {H,  C)  consists  of  a  set  of  n  hosts  H 
and  communication  links  O,  communicating  via  message 
passing  in  the  absence  of  physical  clock  synchronization.  A 
host  is  equated  with  the  processes  running  on  it.  A  multicast 
group  is  a  set  of  k  hosts  in  a  network  of  H  hosts,  which  is 
addressable  collectively  by  a  unique  group  address. 

Message  dissemination  is  assumed  to  be  genuine  multi¬ 
cast,  i.e.,  a  source  sends  a  message  m  once  to  the  network 
interface  in  a  multicast  enabled  backbone,  which  replicates 
TO  at  multicast  enabled  routers  on  its  path  to  r  <=  n  re¬ 
ceivers.  This  stands  in  contrast  to  most  prior  work  on  or¬ 
dered  multicasting  assuming  either  unicast,  where  a  mes¬ 
sage  must  be  sent  r  times  from  a  source  to  the  network  in¬ 
terface  to  reach  r  <  n  receivers,  or  broadcast,  where  all  n 


hosts  in  the  network  are  addressed  and  designated  receivers 
must  filter  out  messages  targeted  at  them. 

Four  cases  of  group  connectivity  can  be  observed:  1) 
from  a  single  source  s  to  a  single  group  g,  denoted  as  (s,  g), 
or  2)  to  multiple  groups  G,  (s,  G),  or  from  multiple  sources 
5  to  3)  a  single  group,  (S,  g),  or  4)  to  multiple  groups,  (S, 
G).  Cases  1)  and  2)  have  a  simple  solution:  sequence  num¬ 
bers  fixing  the  ordering  relation  are  added  to  outgoing  mes¬ 
sages  at  the  source  and  are  delivered  in  that  order  at  the 
destinations.  Cases  3)  and  4)  are  more  difficult  to  imple¬ 
ment,  because  sending  messages  from  one  host  is  indepen¬ 
dent  from  other  hosts,  whereas  reception  of  the  same  mes¬ 
sages  may  be  interdependent  and  destination  groups  may 
overlap.  We  are  interested  in  totally  ordered  multicast  from 
multiple  sources  to  multiple  receivers  or  receiver  groups. 
We  assume  that  hosts  do  not  fail  and  network  partitions  do 
not  occur.  Although  very  specific,  we  consider  overlapping 
groups  for  our  protocol,  because  it  was  also  a  focal  point  in 
previous  work  on  ordered  multicast  [13,  14,  16].  Hosts  in 
the  intersection  of  two  overlapping  multicast  groups  should 
receive  a  messages  only  once,  if  this  message  is  sent  to  both 
groups. 

In  total  order,  two  messages  toi  and  TO2  are  sent  to  a 
receiver  set  Rec  in  the  same  relative  order.  For  example,  if 
two  sources,  A  and  B,  send  messages  toi  and  m2  to  receiver 
groups  Gi  and  G2,  respectively,  then  hosts  in  both  groups, 
in  particular  in  the  intersection  Gi  fl  G2,  should  receive 
both  messages  either  in  the  order  {mi, m2),  or  {m2, mi). 
Atomic  order  demands  that  either  all  or  none  of  the  hosts 
in  Rec  receive  the  messages.  A  weaker  notion  of  total  or¬ 
der  is  causal  order,  based  on  Lamport’s  “happened  before” 
relation  [19].  While  a  causal  precedence  relation  between 
two  messages  preserves  their  sending  order  at  delivery  time, 
messages  without  causal  linkage  may  still  delivered  to  dif¬ 
ferent  hosts  in  different  order.  We  assume  that  all  logical 
point-to-point  channels  between  any  pair  of  hosts  are  FIFO, 
which  prevents  that  an  earlier  message  by  the  same  process 
is  overtaken  in  delivery  by  a  later  message.  If  not  provided 
by  the  network  layer,  FIFO-delivery  over  non-FIFO  chan¬ 
nels  can  be  implemented  by  having  the  source  process  add 
a  sequence  number  to  its  messages  and  let  destinations  de¬ 
liver  according  to  such  sequence  numbers  [4] . 

Finally,  we  assume  that  a  reliable,  unordered  multicast 
protocol  is  running  at  every  host  providing  reliable  delivery 
of  a  message  to  all  operational  hosts  in  a  target  multicast 
group.  Ordered  multicast  should  be  host  minimal,  i.e.,  no 
other  hosts  should  be  affected  by  multicast  of  a  message 
than  the  source  and  receivers,  and  message  minimal,  i.e., 
the  message  size  is  a  function  of  the  size  of  the  receiver  set 
and  not  of  an  entire  session  or  network  [27].  Total  order 
multicast  in  a  broadcast  model  is  for  instance  not  host  mini¬ 
mal.  Looking  at  end-to-end  debates  [8,  28],  we  subscribe  to 
the  view,  that  ordering  can  be  provided  as  middleware  com- 


plementing  reliable  multicasting  to  motivate  reusable  cod¬ 
ing  and  easier  deployment,  as  shown  in  Figure  1 .  We  jus¬ 
tify  this  approach  based  on  the  observation  that  many  net¬ 
worked  multimedia  applications  are  based  on  similar  media 
characteristics  and  delivery  semantics.  In  contrast,  appli¬ 
cations  such  as  the  MBone  whiteboard  tool  [12]  provide 
application-level  ordering  of  messages. 


application  layer  services 

TCP 

ordered  reliable 
multicast  multicast 

IP  unicast 
routing 

IP  multicast 
routing 

lower  layer  network  services 

Figure  1.  Network  protocol  stack  subsuming  or¬ 
dered  multicast. 


3  TOM  Protocol  Description 

The  Tree-based  Ordered  Multicast  (TOM)  protocol  re¬ 
lies  on  an  underlying  reliable  multicast  tree  for  propaga¬ 
tion  of  ordering  information  besides  acknowledgments  and 
retransmissions.  This  tree  is  assumed  to  approximate  the 
underlying  multicast  routing  tree,  which  for  the  Internet 
is  built  using  various  protocols  such  as  DVMRP,  CBT  or 
PIM-SM  (cf.  [15]  for  a  general  overview).  For  the  follow¬ 
ing  description,  we  assume  that  hosts  do  not  fail  and  net¬ 
work  partitions  do  not  occur.  Trees  can  be  constructed  per 
source,  which  amortizes  itself  only  for  long-lived  or  large- 
volume  transmissions,  or  dissemination  can  be  based  on  a 
shared  tree,  across  which  (negative)  acknowledgments  are 
relayed  between  hosts.  In  such  a  tree,  sources  may  change 
frequently,  only  one  collective  infrastructure  must  be  main¬ 
tained,  and  a  source  need  not  know  the  identity  of  all  re¬ 
ceivers  in  the  multicast  group.  However,  the  paths  from 
sources  to  receivers  may  be  suboptimal. 

It  is  unimportant  for  the  description  of  the  ordering 
mechanism,  which  reliable  multicast  protocol  is  used.  It 
is  also  not  crucial,  whether  the  end-to-end  multicast  tree  is 
source-based  or  shared,  but  we  will  exemplify  how  TOM 
operates  to  provide  total  order  in  a  shared  tree.  The  key 
idea  in  TOM  is  to  multicast  a  message  from  a  source  to 
a  receiver  set  combined  with  sending  ordering  information 
for  the  message  (sequence  numbers  or  time  stamps)  to  a 
common  node  on  the  tree  elected  as  ordering  node  for  this 
receiver  set  (or  multicast  group).  The  ordering  node  se¬ 
quences  messages  assigned  to  it  and  multicasts  binding  se¬ 
quence  numbers  for  final  delivery  to  the  receiver  set,  where 
pending  messages  are  to  be  delivered.  TOM  can  be  de¬ 
ployed  in  the  form  of  an  API  accessible  to  applications  with 
ordering  needs. 


3.1  Data  Structures 

An  host  in  the  multicast  tree  is  either  a  source  node  (SN), 
an  extra  node  (EN),  a  primary  node  (PN),  an  ordering  node 
(ON),  or  a  receiver  node  (RN).  Since  every  host  in  the  mul¬ 
ticast  session  runs  the  ordering  protocol,  roles  are  assumed 
on-the-fly  and  no  dedicated  hardware  is  needed.  SN  emit 
messages  to  one  or  more  multicast  groups  in  a  session.  EN 
are  nodes,  which  are  not  a  member  of  the  receiver  set  for 
a  message,  relaying  messages  upward  or  downward  in  the 
tree  without  participation  in  the  ordering  process.  PN  are 
hosts  on  the  upward  ordering  path  from  SN  to  ON,  ag¬ 
gregating  control  messages  in  local  order  and  forwarding 
revised  sequence  numbers  up  in  the  tree.  The  ON  is  the 
sequencer  node  for  a  message,  gathering  sequence  number 
bids  set  on  route  by  PN,  deciding  on  a  globally  valid  num¬ 
ber,  and  multicasting  the  message  to  the  receiver  set  with  a 
final  and  binding  sequence  number  directive.  Sources  can 
be  ON,  as  well.  RN  are  message  recipients,  delivering  them 
according  to  an  ON-sanctioned  sequence  number.  Nodes 
can  be  SN  for  their  own  messages  and  assume  all  other  roles 
for  other  messages.  Edges  in  the  acknowledgment  tree  point 
from  children  nodes  to  their  parents. 

A  TOM  message  m  =  consists  of  a  control 

header  and  body  m!’ ,  with 

=  (SNJd,  Rec,  seq#,  ts,  of) 

where  SNJd  is  the  source  identifier,  Rec  is  the  target  re¬ 
ceiver  set  (which  is  either  a  multicast  group,  or  a  collection 
of  individual  node  identifiers),  seg#  is  the  sequence  num¬ 
ber  used  for  ordering,  ts  is  an  optional  timestamp  for  order¬ 
ing  using  timing  information  at  nodes,  and  of  is  the  order¬ 
ing  flag  indicating  that  a  binding  sequence  number  for  the 
message  has  been  set.  contains  the  actual  data  stream. 

Each  node  maintains  two  message  windows  for  order¬ 
ing:  a  window  for  unordered  messages  (uw),  which  have 
been  received,  but  whose  delivery  is  pending,  and  an  or¬ 
dered  messages  window  (ow)  for  messages,  which  are  cor¬ 
rectly  ordered  and  can  be  delivered  to  local  processes.  The 
sizes  of  these  buffers  are  limited  by  the  number  of  hosts  in 
the  largest  multicast  group  known  at  the  time  of  buffer  al¬ 
location.  Each  host  programs  its  local  network  interface  to 
subscribe  to  multicast  packets  on  the  same  local  network,  or 
to  receive  packets  from  routers  based  on  IGMP  information 
(cf.  [15]). 

3.2  Operation 

TOM  performs  message  ordering  in  four  steps:  1)  a  mes¬ 
sage  multicast  from  each  SN  to  receivers;  2)  a  control  mes¬ 
sage  unicast  from  SN  across  PN  to  the  ON  for  the  desig¬ 
nated  multicast  group  or  transmission,  where  PN  aggregate 


messages  from  their  subtrees  and  hence  stagger  the  order¬ 
ing  process  upward  in  the  tree;  3)  determination  of  a  bind¬ 
ing  sequence  number  for  this  message  and  a  multicast  to 
the  receiver  group;  and  4)  the  delivery  of  messages  at  end 
hosts  according  to  the  agreed-upon  sequence  numbers.  The 
goal  is  to  deliver  messages  consistently  in  an  order  all  hosts 
agree  to,  without  requiring  sources  to  know  the  constituency 
of  the  receiver  set.  Multicast  group  information  is  assumed 
to  be  available  from  a  session  directory  service. 

To  allow  for  selective  addressing  of  hosts  and  dynamic 
election  of  an  ON  we  introduce  a  labeling  mechanism 
known  from  multiprocessor  routing  and  recently  proposed 
for  reliable  multicast  in  the  tree-based  protocol  Lorax  [20], 
and  for  multicast  routing.  Labels  allow  for  open  ordered 
multicast,  i.e.,  addressing  of  specific  nodes  in  the  tree  with¬ 
out  the  need  to  manifest  a  separate  multicast  group  or  re¬ 
vealing  IP-addresses,  and  facilitate  self-routing  of  messages 
to  their  destinations  based  on  prefix  comparison.  Each  node 
i  in  the  acknowledgment  tree  is  labeled  with  a  unique  la¬ 
bel  l(i),  which  is  the  prefix  of  all  children  of  i.  The  label 
alphabet  is  a  set  of  symbols  with  a  defined  order,  such  as 
integers  or  letters  with  lexicographic  order,  with  the  alpha¬ 
bet  cardinality  corresponding  to  the  tree  branching  factor  B. 
The  heuristics  to  select  an  ON  is  as  follows:  for  each  set  of 
messages  destined  to  a  particular  multicast  group  or  set  of 
hosts,  elect  as  ON  the  node,  whose  label  is  the  longest  com¬ 
mon  prefix  among  all  node  labels  in  the  receiver  set.  Each 
ON  gathers  sequence  number  bids  set  en  route  by  PNs,  de¬ 
ciding  on  a  globally  valid  number,  and  multicasts  the  re¬ 
spective  message  to  the  receiver  set  with  a  final  and  binding 
sequence  number  directive.  Eigure  2  illustrates  the  mechan¬ 
ics  of  TOM. 
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Figure  2.  Ordered  multicast  on  acknowledgment 
tree  using  address  labels  (node  labels  are  only 
depicted  if  nodes  are  involved  in  transmission.) 

Node  r,  as  the  root  of  the  tree,  carries  label  1 .  Node  d 
is  the  only  child  in  this  multicast  session,  carrying  the  pre¬ 
fix  of  its  parent  r  concatenated  with  its  own  index  0.  All 
three  sources  of  messages,  nodes  x,y  and  2:  have  labels  of 
length  5,  being  positioned  at  depth  5  in  the  tree.  The  key 


idea  with  using  labels  for  the  ordering  procedure  is  to  cre¬ 
ate  a  confluence  of  messages  at  strategically  optimal  nodes 
in  the  tree  for  ordering  a  number  of  messages  arriving  in 
the  same  time  window.  Rather  than  depending  on  a  stati¬ 
cally  assigned  ordering  node,  ON  is  dynamically  selected 
per  transmission  as  the  node  with  the  longest  common  pre¬ 
fix  among  the  sources  of  pending  messages  in  the  targeted 
multicast  group,  without  the  need  to  pass  an  election  token 
among  nodes. 

Consider  the  case  that  x,  y  and  2  want  to  multicast  mes¬ 
sages  to  a  multicast  group  Rec  =  {x,  y,  2,  a,  b,  c,  d,  e,  /}. 
Each  source  multicasts  its  message  to  Rec,  where  it  is  en¬ 
tered  in  the  order  of  collective  arrival  into  uw.  Control 
messages  and  niy  are  routed  from  SN  x  and  y,  re¬ 
spectively,  across  their  parents  to  the  first  common  prefix 
node  c,  are  intermittently  ordered  at  c  and,  with  revised  se¬ 
quence  numbers,  percolated  up  in  the  tree  to  node  d,  where 
message  header  is  also  arriving.  At  any  node  on  the 
path,  a  bitmask  operation  on  the  matching  prefix  indicates, 
which  messages  must  be  up-routed  or  handled  locally.  At 
d  it  is  determined  that  its  label  10  matches  the  longest 
common  prefix  of  SN  labels  l{x),l{y)  and  l{z).  Hence, 
ON{mx,my,mz)  =  d  and  node  d  sequences  and  multi- 
casts  updated  message  headers  to  Rec  to  signal  that  the  as¬ 
sociated  messages  can  be  delivered.  Once  each  receiver  in 
Rec  receives  the  ordering  information  per  message  m  with 
of  =  true  from  ON,  it  shifts  m  into  the  ow,  where  the 
heading  element  is  delivered  to  end-processes  first. 

Similarly,  messages  to  a  multicast  group  located  in  a  left 
subbranch  of  the  acknowledgment  tree  can  be  handled  lo¬ 
cally  by  the  ON  of  that  group,  without  affecting  any  nodes 
in  other  segments  of  the  tree.  The  only  overhead  incurred 
in  the  ordering  process  is  the  control  message  unicast  from 
SNs  to  some  ON,  plus  one  multicast  to  the  receiver  set. 
Total  order  is  hence  achieved  in  a  diffusing  computation, 
where  the  ordering  process  is  carried  out  along  with  the 
message  multicast,  but  neither  are  receiver  nodes  burdened 
with  sorting  out  messages,  nor  do  they  have  to  know  the 
identity  of  ON.  Through  the  percolation  process  from  SN 
to  ON,  usage  of  the  same  sequence  number  for  a  specific 
message  to  all  receivers  in  a  multicast  group  is  guaranteed. 

Labels  allow  open  ordered  multicast,  i.e.,  addressing  of 
specific  nodes  in  the  tree  with  an  ordered  message  sequence 
without  the  need  to  manifest  a  separate  multicast  group,  and 
for  self-routing  of  messages  to  their  destinations  based  on 
prefix  comparison.  Eigure  3  specifies  the  ordering  algo¬ 
rithm  of  TOM( )  that  an  ontree  host  i  may  use  to  send  a  mes¬ 
sage  m  totally  ordered  to  a  receiver  set  Rec  (hosts  are  as¬ 
sumed  to  carry  prefix  labels).  Procedure  TOMjend()  mul¬ 
ticasts  a  message  to  the  receiver  set  and  unicasts  the  control 
header  towards  the  dynamically  elected  ON;  TOM_cast() 
self-routes  messages  to  a  receiver  based  on  prefix  labels; 
and  TOM_receive()  checks,  whether  a  node  is  EN,  PN,  ON, 


or  RN  and  takes  according  actions: 


proc  TOM  (node  i) 

Cobegin 

TOM_send();  TOM_receive() 

Coend 

proc  TOM_send  (message  ,  receivers  Rec) 

Begin  /*  i  is  SN  */ 

Ifm^t  0Andi  =  SN(m) 

Then  =  (l(i),  Rec,  seq_#,  false) 

TO  =  (m^  ,m^) 
reliable_multicast(m,  Rec) 

TOM_cast(TO^ ,  parent(i)) 

End 

proc  TOM_cast  (message  m,  receiver  rec) 

Begin  /*  self-route  from  node  i  to  rec  */ 

If|((i)|  >  |((recei?)er)|) 

Orif  l(i)  7!:  prefix(rec) 

Then  If  3  parent(i) 

Then  unicast(m)  to  parent(i) 

Else  If  3  children(i) 

Then  unicast(m)  to  child(i) 

where  l(child(i))  =  prefix(l(rec)) 

End 

proe  TOM_receive  (message  m,  receivers  Rec) 

Begin 

If  i  ^  Rec(m)  /*  i  is  EN  */ 

Then  unicast(m)  to  parent(i); 

Elself  of(m)  =  false  And  mP  =  0  /*  i  is  PN  */ 
Then  tag  m  with  new  seq#; 

TOM_cast(m,  parent(i)) 

Elself  l(i)  G  Rec  /*  i  is  RN  */ 

Then  If  mP  ^  0  And  of(m)  =  false 
Then  insert  (m^ ,  uw) 

Else  shift  m  from  uw  to  ow 
deliver(head(ow),  local  processes) 
Else  compute  longest  common  prefix  Icp  =  (m,  pm) 
If  Icp  =  l(i)  /*  i  is  ON  */ 

And  (query  parent(i)  for  pending  msgs  =  0) 
Then  Forall  msgs  in  uw  select  seq#s 

shift  msgs  with  set  seq#s  into  ow 
TOM_cast(head(ow,  Rec)) 

End 


Figure  3.  TOM  procedures  for  ordered  multicast 
from  host  i  to  other  hosts  and  for  processing  re¬ 
ceived  messages  from  other  nodes  for  ordered 
local  delivery. 

Consider  the  special  case  of  ordering  with  this  mecha¬ 
nism,  when  messages  must  be  sent  to  two  different,  but 
overlapping  multicast  groups,  e.g.,  Gi  =  {a,  b,  c}  and 
G2  =  {c,  d,  e,  /},  with  Gi  Ci  G2  =  c.  Nodes  in  each  group 
must  receive  a  given  message  sequence  in  total  order,  and 
node  c  shall  not  receive  contradictorily  ordered  messages. 
This  case  can  be  solved,  if  individual  membership  in  tar¬ 
get  groups  is  known.  Instead  of  choosing  the  node  with 
the  longest  common  prefix  as  ON,  the  nodes  with  multiple 
membership  will  then  be  the  ordering  cores  for  a  transmis¬ 


sion,  prescribing  their  sequencing  decisions  to  their  respec¬ 
tive  ON.  In  our  case,  c  will  be  instrumental  in  informing  d 
about  the  sequence  in  group  Gi ,  such  that  d  can  construct  a 
sequence  from  this  that  is  compatible  with  G2  ■ 

While  total  order  of  messages  within  one  or  more  des¬ 
tination  multicast  groups  is  ensured,  causal  order  among 
messages  is  not  preserved  in  the  above  algorithm.  To  pro¬ 
vide  causality,  the  sequence  numbers  of  messages  to  be  or¬ 
dered  must  encode  causal  dependency  information  before 
reaching  ON.  This  can  be  achieved  for  instance  by  Lam¬ 
port  clocks  maintained  by  all  nodes  belonging  to  a  multicast 
group,  and  updating  sequence  numbers  in  the  staggered  or¬ 
dering  process  to  preserve  the  causal  relations.  To  imple¬ 
ment  atomicity  in  delivery,  that  is,  either  all  RN  in  Rec{m) 
receive  message  m  or  none,  another  message  exchange  be¬ 
tween  all  RN  and  ON  must  be  introduced,  where  all  RN 
signal  reception  of  m  and  to  ON,  and  ON  must  send 
another  ok_to_de liver  (m)  signal  for  RN  to  collectively 
proceed  with  delivery. 

Resilience  is  another  important  aspect  in  TOM  operation 
that  we  only  briefly  discuss  for  space  reasons.  Ordering  can 
be  linked  with  several  types  of  reliability  [13],  including  1) 
giving  no  guarantee  on  the  reliability  of  ordered  deliveries, 

2)  assuming  only  inconsistent  deliveries  with  failed  hosts, 

3)  inciting  roll-backs  at  operational  hosts  to  repair  incon¬ 
sistent  deliveries,  and  4)  the  assumption  that  inconsisten¬ 
cies  never  happen.  Furthermore,  another  set  of  choices  ad¬ 
dresses  the  time  it  takes  to  deliver  a  message,  and  to  which 
recipients  the  delivery  guarantee  extends.  In  the  event  of 
host  or  link  failures,  the  ordering  tree  may  be  partitioned 
into  subtrees,  each  of  which  may  continue  to  run  TOM.  A 
vanished  ON  will  be  replaced  by  the  next  common  node 
in  the  destination  set  according  to  the  label  semantics.  In 
operational  subgroups,  the  semantics  of  reliable  delivery  is 
preserved  for  all  multicast  operations.  Failure  and  recovery 
events  must  be  made  known  to  all  operational  hosts  in  an  or¬ 
dered  fashion.  Partitioned  subbranches  of  the  ordering  tree 
may  rejoin  as  soon  as  communication  paths  between  them 
are  reestablished.  A  link  failure  is  detected,  when  a  host 
fails  to  probe  a  neighbor  node  on  the  tree  before  expiration 
of  a  local  timer.  A  host  failure  is  detected,  when  a  host  with 
a  pending  queue  of  messages  does  not  receive  an  expected 
message  within  a  timeout  period. 

4  Taxonomy  and  Performance  Comparison 

We  classify  predominant  ordering  paradigms  using  re¬ 
liable  broadcast  or  multicast  into  two  main  classes,  as  de¬ 
picted  in  Fig.  4:  1)  geometry-independent  protocols  such 
as  symmetric,  two-phase  or  centralized  solutions,  and  2) 
geometry-dependent  protocols  such  as  ring-based  and  tree- 
based  solutions.  Some  schemes  may  involve  all  hosts  in 
the  ordering  process  in  a  decentralized  way,  using  message 


stability  properties,  versus  solutions  that  burden  one  or  a 
few  hosts  with  the  responsibility  to  order  messages  on  be¬ 
half  of  the  hosts  in  a  multicast  group.  The  main  problem 
in  the  first  case  is  to  reach  consensus  among  hosts  on  or¬ 
dering  patterns,  the  problem  in  the  second  case  is  to  elect 
sequencer  nodes.  Our  taxonomy  contrasts  the  distinction 
between  symmetric  and  token-site  algorithms  proposed  by 
Rodrigues  et  al.  [26],  which  does  not  accommodate  meth¬ 
ods  that  are  neither  symmetric  nor  based  on  token-passing, 
such  as  tree-based  ordering. 
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Figure  4.  Taxonomy  of  ordered  multicast  solu¬ 
tions. 


a  newly  received  packet;  Yf  is  the  time  to  deliver  a  packet 
to  an  end  process;  is  the  processing  overhead  per  mes¬ 
sage  in  protocol  w  = 

Source  nodes  are  denoted  as  SN,  ordering  nodes  as  ON,  and 
receiver  nodes  as  RN  (the  detailed  node  semantics  may  vary 
among  protocols).  M  is  the  number  of  transmissions  for  all 
receivers  to  receive  a  message  orderly. 

4.1  Geometry-Independent  Protocols 

Reliable  broadcast  solutions  are  largely  designed  for 
fault-tolerant,  asynchronous  distributed  systems.  Such  pro¬ 
tocols  are  geometry-independent,  i.e.,  all  hosts  are  assumed 
to  be  fully  connected  with  each  other,  and  routing  between 
hosts  does  not  presume  any  prearranged  host  geometry.  We 
subsume  symmetric,  two-phase  and  centralized  solutions 
under  this  paradigm.  Centralized  ordering  could  also  be 
classified  as  a  star-geometry,  but  the  central  node  is  typ¬ 
ically  chosen  ad  hoc  based  on  some  election  or  token¬ 
passing  scheme  among  all  nodes. 


4.1.1  Symmetric  Ordering 


We  evaluate  the  processing  load  X  at  involved  hosts 
and  the  message  overhead  M  required  to  successfully  mul¬ 
ticast  a  message  in  order  from  a  source  to  all  receivers. 
We  assume  IP-multicast  as  the  dissemination  model  for  all 
schemes,  although  all  schemes  except  TOM  have  been  pro¬ 
posed  in  broadcast  systems.  The  goal  of  this  comparison 
is  not  an  elaborate  modeling  of  the  many  possible  nuances 
and  optimizations  of  ordering  schemes  in  conjunction  with 
reliable  multicast,  but  rather  a  plain  comparison  of  the  fun¬ 
damental  working  structure  of  ordering  solutions.  To  this 
end  we  do  not  include  loss  probabilities  and  assume  that 
all  schemes  consistently  use  sender-initiated  or  receiver- 
initiated  error  recovery  [9].  Sender-initiated  models  place 
the  burden  for  processing  acknowledgments  and  requests 
for  corrupt  or  lost  packets  on  the  transmission  source,  op¬ 
posite  to  receiver-initiated  solutions,  where  retransmissions 
are  handled  in  local  groups  among  receivers  and  sources 
are  contacted  only  in  cases  of  unrecoverable  packet  loss. 
Receiver-initiated  protocols  achieve  better  scalability  be¬ 
cause  a  source  is  likely  only  contacted  in  case  of  packet 
loss. 

The  notation  used  is  as  follows:  s  is  the  number  of 
sources  transmitting  a  message  m  destined  to  the  same  re- 
ceiver(s)  at  a  given  time  (each  sender  is  assumed  to  be  re¬ 
ceiver);  r  is  the  number  of  receivers  of  m  in  the  receiver 
set  Rec{m)',  Xf  is  the  time  to  feed  a  packet  from  a  higher 
protocol  layer;  Xp  is  the  time  to  process  the  transmission 
of  a  packet  (including  retransmissions);  X^  is  the  time  to 
process  a  sequence  number  check;  Yp  is  the  time  to  process 


In  symmetric  schemes  (S)  [6,  II,  27],  all  hosts  partake  in 
the  ordering  process  in  a  decentralized  way,  analogous  to 
a  voting  process,  using  message  stability  properties.  SN 
disseminate  messages  reliably  to  all  hosts,  which  assign  a 
timestamp  to  each  message  and  place  it  in  a  pending  buffer; 
for  each  message  to,  participant  hosts  (SN  and  RN)  agree 
on  a  unique  order  number  using  timestamp  information  by 
running  a  consensus  protocol;  a  message  with  an  assigned 
order  number  is  shifted  to  the  delivery  queue  and  delivered 
to  end  processes  in  the  globally  binding  order.  Thus  the 
number  of  messages  to  be  exchanged  is  a  function  of  the 
hosts  in  the  system  involved  in  the  ordering  process.  With 
Xc  denoting  the  extra  cost  for  the  consensus  protocol,  the 
expected  overhead  of  a  generic  symmetric  protocol  at  SN 
and  RN  is 

Xfjv  =  Xf+rXp  (1) 

xIn  =  s{Yp+X#+rXc+Yf) 

With  broadcast  communication,  a  source  node  sends  a  mes¬ 
sage  to  r  —  1  receivers,  which  in  turn  send  r  —  1  mes¬ 
sages  to  agree  on  the  final  sequence  number,  i.e.,  Mbc  = 
s{{r  —  1)  -b  r(r  —  1)),  that  is  0{sr^)  for  s  sources.  With 
multicast  and  r  <  n  receivers,  M  =  s{l  -i-  2r),  that  is  one 
multicast  message  to  all  receivers,  one  multicast  per  each  of 
the  r  receivers  to  each  other,  and  one  timestamp  sweep  from 
all  receivers  to  the  source.  Protocols  with  fault-tolerance 
measures  may  incur  significantly  higher  cost  [27]. 


4.1.2  Two-phase  Ordering 

In  2-phase  ordering  (2P)  [5],  four  communication  steps  are 
required:  a  source  sends  a  message  to  to  a  multicast  group, 
where  each  receiver  assigns  a  priority  number  to  the  mes¬ 
sage,  places  TO  as  pending  in  its  local  queue,  and  returns  the 
priority  number  to  the  source.  The  source  selects  the  high¬ 
est  number  and  sends  it  to  all  receivers,  which  replace  the 
original  number  with  the  new  one,  tag  the  message  as  deliv¬ 
erable,  reorder  the  queue  and  deliver  messages  heading  the 
queue.  The  expected  overhead  at  SN  and  RN  is 

xTr,  =  Xj--Hr(yp-hX#-H2Xp)  (2) 

xYn  =  s{2Yj,  -H  X#  -H  Xp  -H  Yf) 

If  we  assume  r  >=  s,  then  =  max{Xg^,Xj^)  = 
0{r).  With  multicast,  one  message  from  s  sources  to  r  re¬ 
ceivers,  r  control  messages  with  priority  numbers  back  to 
each  source,  and  one  final  control  message  multicast  from 
the  source  to  the  receiver  set  for  each  message  are  required, 
i.e.,  M  =  s(l  -I-  r). 

4.1.3  Centralized  Ordering 

In  centralized  ordering  (C)  [3,  7,  21],  a  source  SN  transmits 
a  message  to  to  a  sequencer  host,  which  assigns  a  unique 
number  to  to  and  forwards  it  to  the  receiver  set  Rec{m), 
where  it  is  ultimately  delivered  to  end  processes  in  the  order 
prescribed  by  sequence  numbers.  The  sequencer  role  may 
rotate  among  hosts.  The  expected  overhead  at  SN,  ON,  and 
RN  is 

xFjv  =  W/  +  Wp  (3) 

xSn  =  s(yp  +  X#  +  rXp) 
xSn  =  s(Tp  +  T/) 

Hence  X'^'  =  0{sr),  and  M  =  s  -|-  r,  consisting  of  s  mes¬ 
sages  from  sources  to  ON,  and  one  multicast  per  message 
from  ON  to  all  receivers.  If  SN  =  ON,  we  spare  one  step. 

4.2  Geometry-Dependent  Protocols 

Geometry-dependent  protocols  presume  a  specific  host 
topology  to  route  ordering  information. 

4.2.1  Ring-based  Ordering 

In  ring-based  ordering  (R)  [2,  25,  31],  a  logical  ring  im¬ 
poses  a  transmission  path  between  hosts,  where  each  host 
needs  only  communicate  with  its  predecessor  and  successor 
in  the  ring.  To  multicast  a  message,  a  host  must  possess 
the  token;  the  token  contains  requests  for  messages  to  be 
resent  and  the  highest  sequence  number  for  any  message 
broadcast  on  the  ring;  each  host  maintains  an  input  buffer 
containing  pending  messages  with  assigned  sequence  num¬ 
bers;  on  receipt  of  the  token,  the  host  completes  processing 
of  messages  in  its  buffer  by  adjusting  sequence  numbers. 


resends  messages  requested  in  the  token,  updates  the  token 
information  and  forwards  the  token;  messages  are  sent  to 
end  processes,  when  marked  as  deliverable.  Each  SN,  as 
token-site,  assumes  the  role  of  ON.  With  Xtk  indicating  the 
token  transfer  time,  the  expected  overhead  at  SN  and  RN  in 
a  single  ring  is 

xfjv  =  X/-f  Xp-fr(yp-bX#-HXp)-HW*  (4) 

x^N  =  s(yp  +  x#+y/) 

Hence  X^  =  0(r),  if  r  >  s,  and  the  message  overhead 
is  at  best  M  =  2n/A:,  where  2n  is  the  number  of  token 
transfers  required  to  accept  k  multicast  messages  in  a  ring 
of  n  nodes  [25].  With  k  =  1,  s  sources,  and  despite  r  <  n 
receivers,  M  =  2sn. 

4.2.2  Tree-based  Ordering 

For  tree-based  ordering  (T),  we  compare  the  MP  protocol 
by  Garcia-Molina  and  Spauster  [13],  and  the  metagroup 
approach  (MG)  by  Jia  [16,  29]  with  TOM.  Known  tree- 
based  reliable  multicast  protocols  [20,  24,  32]  do  not  fea¬ 
ture  ordering.  Common  to  MP,  MG  and  TOM  is  the  idea 
of  distributing  ordering  responsibility  and  load  across  sev¬ 
eral  nodes  on  the  tree.  While  MP  and  MG  use  group  mem¬ 
bership  information  to  cluster  nodes  for  optimized  message 
delivery,  TOM  uses  the  end-to-end  multicast  topology. 

The  MP  protocol  has  two  work  phases:  1)  the  transmis¬ 
sion  from  the  source  to  a  primary  host,  and  2)  the  trans¬ 
mission  from  this  host  to  the  receivers.  It  builds  a  forest  of 
propagation  trees,  where  hosts  in  the  intersections  of  mul¬ 
ticast  groups  are  chosen  as  hop  nodes,  i.e.,  roots  of  sub¬ 
trees.  A  message  is  first  sent  to  these  primary  hosts,  and 
then  propagated  downward  in  the  tree  toward  the  receiver 
hosts,  being  ordered  on  their  propagation  path,  and  finally 
unicast  to  the  receiver  hosts.  The  MG  protocol  clusters 
hosts  from  overlapping  multicast  groups  into  metagroups, 
which  do  not  overlap.  Each  group  has  a  primary  metagroup 
(PM),  and  in  each  metagroup  one  member  is  assigned  to 
be  manager.  Metagroups  are  are  organized  in  a  forest  of 
propagation  trees,  such  that  the  PM  of  a  group  is  the  ances¬ 
tor  of  all  other  metagroups  of  the  same  group  in  the  tree. 
Messages  destined  to  multicast  group  G  are  first  sent  to 
PM(G),  which  propagates  the  messages  along  the  tree  to 
all  other  metagroups,  which  are  subsets  of  G.  The  manager 
of  a  metagroup  broadcasts  a  message  to  other  members  in 
its  metagroup. 

The  drawback  with  MP  and  MG  is  the  need  to  compute  a 
logical  propagation  or  metagroup  tree  per  source  as  overlays 
to  the  end-to-end  geometry,  which  means  that  in  order  to 
construct  such  a  tree,  the  computation  host(s)  must  know 
the  membership  of  all  groups.  This  approach  works  only  for 
closed  multicast  and  static  groups,  and  amortizes  itself  only 
for  long-lived  transmissions  between  hosts.  The  processing 


overhead  common  to  all  tree-based  schemes  is 


Xjjv  =  Xf+Xp  (5) 

xSn  =  B{Yp  +  X#+Xp) 

xIn  =  Yp  +  Yf 


Hence  generally  =  0{B),  where  B  indicates  the 
branching  factor  of  the  tree.  With  multicast,  = 

s(l  -I-  d)  messages  are  required  -  one  message  from  each  of 
the  s  sources  to  the  primary  destination  in  the  subtree,  and 
one  broadcast  at  each  level  of  the  subtree,  where  d  is  the 
subtree  depth  [13].  MG  has  three  work  phases  and  requires 
one  message  to  PM(G),  d  messages  to  the  managers  of  the 
deepest  metagroups  at  depth  d  in  the  subtree,  and  another 
k  messages  to  the  members  of  the  k  metagroups  containing 
the  target  multicast  group,  i.e.,  =  s(l  +  d  +  k)  [29]. 

TOM  requires  a  multicast  from  s  SNs  to  the  receiver  set, 
and  p  unicasts  from  the  SN  to  the  ON,  where  p  is  the  av¬ 
erage  path  length,  and  one  final  multicast  from  ON  to  RN, 
i.e.,  =  s{2  +  p). 


4.3  Results 


Table  1  summarizes  expected  message  costs  and  delays. 
Centralized  and  two-phase  approaches  incur  only  two  re¬ 
spectively  three  message  exchange  phases,  but  messaging 
is  concentrated  on  specific  hosts  in  the  session,  which  may 
become  a  bottleneck  or  fail.  Rings  engage  all  hosts  in  a  ses¬ 
sion  in  the  transmission  process,  even  when  a  source  and 
multicast  receiver  group  constitute  only  a  small  portion  of 
the  entire  session.  Trees  allow  for  selective  engagement  of 
hosts  on  those  subbranches  or  local  groups,  which  are  actu¬ 
ally  affected  by  the  message  processing. 


Protocol 

X 

M 

Symmetric 

0(sr") 

s(l  -|-2r) 

Two-Phase 

0(r) 

s(l  +  r) 

Centralized 

0{sr) 

s  -\-r 

Ring-based 

0(r) 

2sn 

Tree-based:  MP 

0{B) 

s(l  -b  d) 

Tree-based:  MG 

0{B) 

s(l  d-\-  k) 

Tree-based:  TOM 

0{B) 

s(2  +p) 

Table  1.  Average  processing  overhead  X  and 
multicast  message  cost  M. 


We  assume  that  there  are  as  many  sources  as  receivers, 
r  =  n  and  s  =  1.  In  the  graph,  we  neglect  the  cost  to 
compute  and  maintain  the  propagation  infrastructure,  which 
may  be  substantial  for  MG  and  MP  in  comparison  to  TOM, 
which  simply  relies  on  a  given  acknowledgment  tree.  We 
vary  the  session  size  between  n  =  [1, 1000],  with  r  =  n/10 


as  the  average  size  of  a  receiver  multicast  group.  The  MP 
tree  depth  has  been  projected  between  d  =  [1,8]  for  simula¬ 
tions  with  n  =  200  and  average  group  size  gi  =  [5, 40]  [13]. 
The  tree  depth  for  a  metagroup  tree  has  been  projected  be¬ 
tween  d  =  [1,  5]  for  up  to  40  metagroups  with  g  =  50, 
and  an  overlapping  degree  of  10.  We  also  assume  that  each 
source  sends  only  one  multicast  message  per  transmission 
cycle.  Simulations  for  the  Lorax  protocol  have  indicated 
that  optimal  ack  trees  are  built  when  each  nodes  supports 
at  least  B  =  5  neighbors  [20].  As  a  baseline  comparison, 
we  hence  choose  the  average  depth  of  a  subbranch  in  a  MP 
and  MG  tree  as  d  =  logsr,  where  B  =  5  is  the  average 
node  degree.  The  average  path  length  for  TOM  is  chosen 
as  p  =  /i/2,  because  roughly  half  of  the  height  h  of  the 
tree  needs  to  be  traversed  to  converge  on  a  ON.  Note  that  a 
message  comparison  provides  a  limited  view  on  the  relative 
performance  of  the  protocols,  because  parallelism  in  mes¬ 
sage  processing,  the  processing  overhead  at  various  nodes, 
and  the  shape  of  the  tree  would  need  to  be  considered  in 
a  more  precise  way.  However,  concentrating  on  M  alone 
is  sufficient  to  express  fundamental  differences  between  the 
approaches.  Figure  5  plots  the  multicast  message  cost  of  the 
various  schemes  under  given  assumptions. 


Multicast  message  cost  tor  various  receiver  sets 


Figure  5.  Average  message  cost  with  multicast. 

The  obtained  results  picture  only  a  special  scenario,  un¬ 
der  which  the  discussed  protocols  perform,  namely  genuine 
multicast  with  one  source  transmitting.  The  multiple  source 
case  would  reinforce  that  the  throughput  of  a  generic  tree- 
based  protocol  for  ordered  reliable  multicast  scales  better 
with  receiver  set  due  to  locus  and  execution  of  sequencing. 
Symmetric  methods  exhibit  the  least  scalability,  because  all 
nodes  are  involved  in  processing  messages  from  all  other 
nodes.  If  all  nodes  broadcast  at  the  same  time,  latency  may 
be  low,  but  a  consensus  protocol  must  be  run.  Two-phase, 
centralized  and  ring  solutions  have  similar  message  over¬ 
head,  however,  rings  may  permit  higher  concurrency,  with 


the  drawback  that  latency  increases  in  large  sessions.  The 
centralized  ordering  method  is  reasonable  for  few  hosts,  but 
is  a  potential  bottleneck  and  single  point  of  failure,  partic¬ 
ularly  for  large  sessions.  A  logical  hop  between  hosts  in 
MP  and  MG  may  be  multiple  hops  across  long  distances  in 
the  multicast  routing  tree,  in  contrast  to  TOM,  which  oper¬ 
ates  under  the  assumption  that  the  structure  of  the  ack  tree 
mirrors  the  path  information  in  the  multicast  routing  tree, 
rather  than  using  separate  propagation  graphs.  Comparing 
the  three  tree-based  methods,  TOM  performs  equal  or  better 
than  MG  and  MP,  spreads  the  computational  load  of  order¬ 
ing  over  multiple  nodes  in  the  tree,  and  is  suited  well  for 
dynamically  changing  multicast  groups,  rather  than  cater¬ 
ing  to  static  membership  and  long-lived  transmissions. 

5  Related  Work 

Much  work  on  total  and  causal  ordering  for  multicast  is 
centered  around  fault  tolerance  or  consistency  issues  in  dis¬ 
tributed  systems.  Chandra  and  Toueg  [6]  have  shown  that 
total  order  broadcast  and  consensus  are  equivalent  prob¬ 
lems  in  asynchronous  systems.  Many  protocols  from  this 
field  suffer  from  high  overhead  to  achieve  fault  tolerance  by 
introducing  messaging  to  implement  a  failure  detector  and 
consensus  mechanism. 

Symmetric  or  decentralized  approaches  are  based  on  the 
total  order  solution  by  Lamport  [19],  where  timestamps  are 
assigned  to  messages  at  broadcast  time.  The  algorithm  by 
Chandra  and  Toueg  [6]  executes  reliable  broadcast  in  four 
communication  steps  with  a  weak  failure  detector,  and  the 
solution  by  Dolev  et  al.  [11]  is  based  on  a  majority  con¬ 
sensus  protocol.  Both  schemes  have  O(n^)  message  com¬ 
plexity.  Newer  approaches  such  as  the  TO_multicast  proto¬ 
col  [17]  or  Scalatom  [27]  incur  message  complexity  O(r^). 
Hybrid  algorithms  using  distributed  messaging  in  conjunc¬ 
tion  with  centralized  sequencing  [26]  have  been  proposed 
to  achieve  better  scalability. 

The  ISIS  system  [5]  implemented  two-phase  ordering 
using  vector  clocks  for  logical  time  keeping,  an  explicit 
membership  model  and  layered  microprotocols  such  as  CB- 
CAST  for  causal  ordering  and  ABCAST  for  total  ordering. 
Later  versions  of  ISIS  have  adopted  a  ring-based  approach. 

Centralized  reliable  broadcast  approaches  have  been  pro¬ 
posed  in  the  RBP  protocol  [7]  with  a  rotating  token  to  iden¬ 
tify  the  sequencer,  the  Amoeba  system  [18],  where  the  ker¬ 
nel  passes  messages  to  a  dedicated  sequencing  processor, 
or  the  protocol  by  Navaratnam  et  al.  [21],  where  a  primary 
manager  orders  messages  and  forwards  them  to  a  secondary 
manager  heading  each  local  multicast  group  delivering  to 
application  processes.  Centralized  reliable  multicast  pro¬ 
tocols  such  as  URGC  [1],  MTP  [3]  or  XTP  [30]  follow  a 
similar  principle. 

Protocols  of  the  ring  type,  such  as  Totem  [2],  RMP  [31] 


or  TPM  [25]  implement  total  ordering  and  feature  resilience 
toward  network  partitions  and  process  failures.  A  related 
approach  is  the  token-bus  protocol  MLMO  [33],  which  en¬ 
sures  causal  and  total  ordering  in  a  hierarchical  infrastruc¬ 
ture  to  connect  internetworks. 

The  tree  protocol  by  Ng  [22]  implements  reliable  broad¬ 
cast  with  source-ordered  delivery  to  a  single  group  allow¬ 
ing  up  to  k  host  failures,  building  a  minimum  spanning 
tree  between  hosts.  Total  ordering  is  achieved  using  two 
vectors  of  time-stamps  at  each  host  to  keep  track  of  the 
last  message  sent  to  and  received  from  each  neighbor.  A 
broadcast  message  with  timestamp  t  is  delivered  to  end  pro¬ 
cesses  only  if  all  messages  with  smaller  timestamps  have 
already  been  received.  The  MP  protocol  by  Garcia-Molina 
and  Spauster  [13]  solves  single  source,  multiple  source,  and 
multiple  group  total  ordering,  including  the  case  that  mes¬ 
sages  are  addressed  to  different,  overlapping  groups.  How¬ 
ever,  delivery  from  the  last  intermediate  host  to  final  desti¬ 
nations  is  unicast,  causal  ordering  is  not  supported,  and  reli¬ 
able  failure  detection  is  required  for  the  protocol  to  operate 
resiliently.  The  improvement  by  Jia  [16,  29]  clusters  hosts 
into  metagroups  representing  intersections  of  overlapping 
groups  and  forming  propagation  graphs  based  on  metagroup 
relations  to  minimize  duplicate  deliveries,  enhance  paral¬ 
lelism  in  delivery,  and  shorten  propagation  graphs.  Var¬ 
ious  tree-based  reliable  multicast  protocols,  TMTP  [32], 
RMTP  [24],  or  Lorax  [20],  have  been  proposed  in  recent 
years.  TMTP  and  RMTP  build  a  source-based  tree  for  flow 
and  error  control,  and  Lorax  features  positional  labels  in  a 
shared  ack  tree  for  concurrent  multicast,  but  all  three  proto¬ 
cols  lack  support  for  multi-source  and  multi-group  ordering. 
TOM  is  geared  towards  adding  ordering  to  reliable  concur¬ 
rent  multicast  as  in  Lorax,  but  it  could  also  be  deployed  in 
TMTP  with  domain  managers,  and  in  RMTP  with  desig¬ 
nated  receivers  as  intermediate  ordering  nodes. 

6  Conclusion 

This  paper  makes  the  case  for  adding  ordering  services 
to  tree-based  concurrent  reliable  multicast,  based  on  the  no¬ 
tion  that  ordered  delivery  of  multimedia  data  is  essential  to 
a  growing  number  of  Internet  applications  supporting  telep¬ 
resence  and  near-synchronous  information  sharing.  Look¬ 
ing  at  reliable  multicasting  for  such  applications,  we  ob¬ 
served  that  ordering  services  have  not  been  considered  as 
an  integrated  component  in  data  dissemination.  The  TOM 
protocol  stands  in  contrast  to  previous  reliable  broadcast 
solutions  tailored  to  local  area  networks,  where  ordering 
was  performed  assuming  symmetric  communication,  cen¬ 
tralized,  ring-based  or  propagation  graph  schemes. 

The  TOM  protocol  is  a  first  solution  working  directly  on 
reliable  multicast  trees,  using  staggered  ordering  of  mes¬ 
sages  on  their  paths  from  sources  to  the  receivers.  Work- 


load  is  hence  distributed,  the  infrastructure  used  for  order¬ 
ing  is  cohesive  with  the  one  for  reliability  provision,  and  the 
addition  of  address  labels  yields  efficient  ordering  for  mul¬ 
tiple  groups  and  subgroups.  Opposite  to  other  prominent 
solutions,  TOM  does  not  require  computation  of  separate 
graphs  for  propagating  ordering  information.  It  implements 
ordering  in  a  diffusing  computation,  where  messages  are 
ordered  on  their  delivery  paths  from  sources  to  receivers, 
and  each  node  deals  only  with  its  children  and  parent  node 
instead  of  the  entire  multicast  group.  We  also  proposed  a 
taxonomy  for  ordering  schemes  integrating  reliable  broad¬ 
cast  and  multicast  solutions.  A  simple  performance  com¬ 
parison  showed  that  ordering  in  trees  surpasses  contending 
solutions  in  terms  of  scalability,  efficiency  and  practicality. 
Extension  of  this  work  include  an  analysis  of  host  process¬ 
ing  costs  with  packet  loss  probabilities  and  a  closer  look  at 
resiliency  issues. 
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